Biodiversity Data at your Fingertips

New England Natural History Conference 2023

Michael T. Hallworth, Ph.D.

Introduction to Vermont Atlas of Life

In 2013, the Vermont Center for Ecostudies launched the Vermont Atlas of Life (VAL) to gather data on species in Vermont and begin to fill knowledge gaps for Vermont’s flora and fauna. The Vermont Atlas of Life is a central library of primary biodiversity data and accumulated knowledge from the past and present. At VAL’s core are the community of people contributing and using information about the changing nature of Vermont: occurrence records, monitoring data, distribution maps, photographs, and other data free of charge to anyone—from backyard naturalists to scientists to policymakers. In short, VAL is one of the most ambitious and far-reaching biodiversity informatics projects Vermont has ever undertaken. From this information, we can begin to better understand what’s here, what’s not, and how biodiversity and species distributions change over time.

The Vermont Atlas of Life has data for nearly 12,000 species across Vermont from 7.7 million occurrence records derived from museum specimens, photographs, and observations by biologists, naturalists, and community scientists.

The Global Biodiversity Information Facility

The Vermont Atlas of Life joins others across the globe in curating primary occurrence records at the Global Biodiversity Information Facility (GBIF), an international network funded by the world’s governments and aimed at providing anyone, anywhere, open access to biodiversity data.

The GBIF network draws many different data streams such as iNaturalist.org, eBird.org, e-Butterfly.org, OdonataCentral.org and more together by using a data standard known as Darwin Core, which allows occurrence and biodiversity data from many different data publishers to work together seamlessly.

In this workshop you’ll learn several ways to access primary occurrence records for use in research projects, lesson plans for students, creating species checklists or to simply fulfill your own curiosity.

Vermont Atlas of Life’s Data Explorer

On your own: Explore the different features of VAL’s Data Explorer.

Understanding the Data

Darwin Core Data Standard

The key to working with data housed at GBIF is understanding Darwin Core (DwC) terms. At first, the data can be daunting because there are 250+ fields associated with every occurrence record. The occurrence record fields are in the table below. Some fields are required for an observation while others are optional and may not have data for all observations. For term definitions see DwC terms or see list of DwC terms.

DwC term
OBJECTID
gbifID
abstract
accessRights
accrualMethod
accrualPeriodicity
accrualPolicy
alternative
audience
available
bibliographicCitation
conformsTo
contributor
coverage
created
creator
date
dateAccepted
dateCopyrighted
dateSubmitted
description
educationLevel
extent
format
hasFormat
hasPart
hasVersion
identifier
instructionalMethod
isFormatOf
isPartOf
isReferencedBy
isReplacedBy
isRequiredBy
isVersionOf
issued
language
license
mediator
medium
modified
provenance
publisher
references
relation
replaces
requires
rights
rightsHolder
source
spatial
subject
tableOfContents
temporal
title
type
valid
institutionID
collectionID
datasetID
institutionCode
collectionCode
datasetName
ownerInstitutionCode
basisOfRecord
informationWithheld
dataGeneralizations
dynamicProperties
occurrenceID
catalogNumber
recordNumber
recordedBy
individualCount
organismQuantity
organismQuantityType
sex
lifeStage
reproductiveCondition
behavior
establishmentMeans
occurrenceStatus
preparations
disposition
associatedReferences
associatedSequences
associatedTaxa
otherCatalogNumbers
occurrenceRemarks
organismID
organismName
organismScope
associatedOccurrences
associatedOrganisms
previousIdentifications
organismRemarks
materialSampleID
eventID
parentEventID
fieldNumber
eventDate
eventTime
startDayOfYear
endDayOfYear
year
month
day
verbatimEventDate
habitat
samplingProtocol
samplingEffort
sampleSizeValue
sampleSizeUnit
fieldNotes
eventRemarks
locationID
higherGeographyID
higherGeography
continent
waterBody
islandGroup
island
countryCode
stateProvince
county
municipality
locality
verbatimLocality
verbatimElevation
verbatimDepth
minimumDistanceAboveSurfaceInMeters
maximumDistanceAboveSurfaceInMeters
locationAccordingTo
locationRemarks
decimalLatitude
decimalLongitude
coordinateUncertaintyInMeters
coordinatePrecision
pointRadiusSpatialFit
verbatimCoordinateSystem
verbatimSRS
footprintWKT
footprintSRS
footprintSpatialFit
georeferencedBy
georeferencedDate
georeferenceProtocol
georeferenceSources
georeferenceVerificationStatus
georeferenceRemarks
geologicalContextID
earliestEonOrLowestEonothem
latestEonOrHighestEonothem
earliestEraOrLowestErathem
latestEraOrHighestErathem
earliestPeriodOrLowestSystem
latestPeriodOrHighestSystem
earliestEpochOrLowestSeries
latestEpochOrHighestSeries
earliestAgeOrLowestStage
latestAgeOrHighestStage
lowestBiostratigraphicZone
highestBiostratigraphicZone
lithostratigraphicTerms
group
formation
member
bed
identificationID
identificationQualifier
typeStatus
identifiedBy
dateIdentified
identificationReferences
identificationVerificationStatus
identificationRemarks
taxonID
scientificNameID
acceptedNameUsageID
parentNameUsageID
originalNameUsageID
nameAccordingToID
namePublishedInID
taxonConceptID
scientificName
acceptedNameUsage
parentNameUsage
originalNameUsage
nameAccordingTo
namePublishedIn
namePublishedInYear
higherClassification
kingdom
phylum
class
order
family
genus
subgenus
specificEpithet
infraspecificEpithet
taxonRank
verbatimTaxonRank
vernacularName
nomenclaturalCode
taxonomicStatus
nomenclaturalStatus
taxonRemarks
datasetKey
publishingCountry
lastInterpreted
elevation
elevationAccuracy
depth
depthAccuracy
distanceAboveSurface
distanceAboveSurfaceAccuracy
issue
mediaType
hasCoordinate
hasGeospatialIssues
taxonKey
acceptedTaxonKey
kingdomKey
phylumKey
classKey
orderKey
familyKey
genusKey
subgenusKey
speciesKey
species
genericName
acceptedScientificName
verbatimScientificName
typifiedName
protocol
lastParsed
lastCrawled
repatriated
relativeOrganismQuantity
recordedByID
identifiedByID
level0Gid
level0Name
level1Gid
level1Name
level2Gid
level2Name
level3Gid
level3Name
iucnRedListCategory

GBIF’s API (Application programming interface)

Understanding the data is one thing, being able to access and harness it is another. GBIF allows access to data via different avenues. You can download data directly through GBIF using the occurrence search tool. We won’t cover that in this workshop but it’s one way to get the data. One huge benefit of downloading the data directly from GBIF is that you get a DOI (digital object identifier) for the dataset you download. The DOI can vastly improve reproducibility and allows users to find and use the same data set as previous research. One downside is that the data files are often very large. The data set of 114,000,000 occurrences for New England is >36 gigabytes. Downloading data can get tedious if you require up-to-date data.

Accessing the data through GBIF’s API is fast, and flexible. However, it can be a little overwhelming if you’ve never used an API service before. The API service is limited but powerful once you learn how to access the data you want.

Keep this reference guide to the GBIF API handy. You will want to refer back to it often throughout the workshop and beyond.

Interacting with GBIF API

Have a look at the different parameters that can be queried using GBIF’s API.


HELP! I’ve never used an API before where do I even start?

Using an API is a way to access a database over the internet. The way GBIF’s API works is by submitting a query request through your browser. GBIF’s database processes your request and returns the data you requested. Below are a few tips on how to submit an API request to GBIF and to make sense of the results.

First, we need to create the API request. Let’s work through a basic example to illustrate how we can do that. Below are the steps to query the API.

  1. Identify relevant API search parameters
  2. Use your internet browser to send data request
  3. Interpret the results
  4. Troubleshoot request if nothing is returned

On your own: Take a few minutes and try and determine how many observations of Common Green Darner (Anax junius) are in GBIF. Expand the code below to see one way to query the API for the data we want.


# Common green darner 
# scientificName = "Anax junius"

# To search occurrences
# base API call: https://api.gbif.org/v1/occurrence/search?

# parameter to search on: scientificName 
# we want all observations with scientificName="Anax junius"

https://api.gbif.org/v1/occurrence/search?scientificName=Anax%20junius

#count: 31,938 (as of 4/19/2023)

Now that you’ve seen how to submit a basic query to GBIF’s API - let’s look at the output.

The API call returns a few different things that we need to pay attention to. First, in the browser there are several options for how to ‘view’ the results. Consider the image below.

You can see the API query that we submitted in the browser. The API returned some results based on our query. The API returns a few fields that we need to understand before moving forward.

offset: 0
limit: 20
endOfRecords: false
count: some number
results: […]
facets: []

Before we talk about offset, limit and endOfRecords - let’s first focus on the count and results fields.

count: This is how many observations satisfy your API query - i.e., the number of records/occurrences
results: The data associated with each observation record. This is where all the information stored in DwC format that accompanies an occurrence record is returned. Note: not all 31,398 records are returned!
limit: Limit used to return observations. Defaults to 20. Maximum number allowed: 300.
offset: This tells the API where to start returning results. If you want observations 301-600 for example, the offset would need to be set at 300. To get observations 1001-1300 an offset of 1000 should be used. We’ll talk more about this later in the workshop.
endOfRecords: true or false value. If the records returned has reached the end of the records in GBIF a true will be returned - otherwise the value is false

How many observations are returned in the results section by default?


On your own: How would you alter the API query to return 300 observations? How would you alter the API query to get observations 25,001-25,300?


Once you’ve given it a try - see one way to satisfy those requirements by expanding the code.

# Get 300 observations 
https://api.gbif.org/v1/occurrence/search?scientificName=Anax%20junius&limit=300

# Get observations 25,001-25,300 
https://api.gbif.org/v1/occurrence/search?scientificName=Anax%20junius&limit=300&offset=25000

Accessing the data via a browser is a great first step into using APIs and to help troubleshoot complex API queries. However, accessing the data in a program where we can summarize, analyze, and map the data can be very powerful. Next, we’ll take what we’ve learned so far and use it to get community science observations into R, a free, open-source statistical computing software.

Using R to query GBIF

The benefit of using a program language like R, is that you can retrieve data, analyze it, create maps and reports all in the same program or environment. In addition, the scripts that you write to retrieve and analyze data document the steps you used to get the final products. As such, you have documentation about what was done and makes your project more likely to be reproducible. It also allows you to make changes to the workflow or to redo analyses more easily.

To save time during the workshop, I’ve set up a virtual computing environment that has R installed and all the required packages to complete the exercises and execute code. I highly encourage participants to use the virtual machine during the workshop. Unfortunately, there is no time during the workshop to help participants install R and obtain the required packages.

The rgbif package

An R package is a set of functions written in R to complete a task. Once the package is loaded into R those functions are available to the user. There is an R package written specifically to interface with GBIF’s API. We’ll work in R for the remainder of the workshop and use the information we learned about GBIF data and GBIF’s API to obtain and analyze community science data. For reference - R code looks like the following in the workshop materials. The code can be copied / written directly into the virtual computing environment and run.

# Anything following a pound sign / hashtag / number symbol is not interpreted by R 
# and it handy for writing notes / commenting R code for future reference 

# Hello world
print("Hello World!")
## [1] "Hello World!"

The reason why we started the workshop introducing Darwin Core and GBIF’s API is because the following workshop materials rely on using those fields to get data. For example, most of the rgbif functions use that information.

We’ll use the rgbif package briefly but like all packages, there are limitations in what it can do. The package is pretty powerful and has lots of different functions we can use. The documentation for the package’s functions can be found here.

Let’s get the package loaded into our R environment so we can use the functions by executing the following code.

# load rgbif package in R
library(rgbif)

Now that we have access to rgbif functions, let’s use R to query GBIF’s API for Common Green Darner (Anax junius) like we did above in the browser. First, we need to determine which function to use to get the data we want. Recall that we were looking for the total number of Common Green Darner observations in GBIF.

# Use the scientificName method to query API #
# set limit = 20 to match browser default 
# NOTE - the package sets limit=500 by default 
#        and it queries the API twice to get 500 results 

GreenDarnerObs <- occ_search(scientificName = "Anax junius",
                             limit = 20)

# Show similar results as API request - 
# print metadata to the R console
GreenDarnerObs$meta
## $offset
## [1] 0
## 
## $limit
## [1] 20
## 
## $endOfRecords
## [1] FALSE
## 
## $count
## [1] 31940

The nice thing about the occ_search function is that it returns the data as well as the counts. Let’s first see what the structure (str) of the returned object looks like.

# Inspect what's in the data 
str(GreenDarnerObs,1)
## List of 5
##  $ meta     :List of 4
##  $ hierarchy:List of 1
##  $ data     : tibble [20 x 90] (S3: tbl_df/tbl/data.frame)
##  $ media    :List of 20
##  $ facets   : Named list()
##  - attr(*, "class")= chr "gbif"
##  - attr(*, "args")=List of 4
##  - attr(*, "type")= chr "single"

You can see there are several things that are returned.
meta: is the metadata and is similar to what’s returned via the browser
hierarchy: This contains the biological hierarchy of the organism (kingdom,phylum,class,order,….,species and the associated taxonKey)
data: These are the DwC fields associated with the observation. Most of our workshop will deal with these fields.
media: Details about any media (mostly photos) submitted with an observation
facets: Internal calculations from API that we won’t use here

Let’s have a quick look at the data returned

GreenDarnerObs$data
## # A tibble: 20 x 90
##    key        scientificName  decimalLatitude decimalLongitude issues datasetKey
##    <chr>      <chr>                     <dbl>            <dbl> <chr>  <chr>     
##  1 4020039301 Anax junius Dr~            26.2            -97.7 cdc,g~ 3ab13bbb-~
##  2 4015431302 Anax junius Dr~            27.2           -109.  cdc,g~ 3ab13bbb-~
##  3 4011663229 Anax junius Dr~            27.3            -80.5 cdc,c~ 50c9509d-~
##  4 4014914876 Anax junius Dr~            21.3           -158.  cdc,c~ 50c9509d-~
##  5 4014939084 Anax junius Dr~            28.1            -80.6 cdc,c~ 50c9509d-~
##  6 4014905236 Anax junius Dr~            21.3           -158.  cdc,c~ 50c9509d-~
##  7 4015004240 Anax junius Dr~            25.4            -80.7 cdc,c~ 50c9509d-~
##  8 4015372322 Anax junius Dr~            29.6            -95.4 cdc,c~ 50c9509d-~
##  9 4015178339 Anax junius Dr~            29.6            -95.4 cdc,c~ 50c9509d-~
## 10 4015020345 Anax junius Dr~            29.6            -95.4 cdc,c~ 50c9509d-~
## 11 4015044527 Anax junius Dr~            30.4            -90.1 cdc,c~ 50c9509d-~
## 12 4018366307 Anax junius Dr~            21.3           -158.  cdc,c~ 50c9509d-~
## 13 4018373377 Anax junius Dr~            30.3            -97.7 cdc,c~ 50c9509d-~
## 14 4022228856 Anax junius Dr~            30.2            -97.6 cdc,c~ 50c9509d-~
## 15 4022032253 Anax junius Dr~            27.5            -97.9 cdc,c~ 50c9509d-~
## 16 4022134399 Anax junius Dr~            29.6            -95.1 cdc,c~ 50c9509d-~
## 17 4022334640 Anax junius Dr~            30.7            -97.6 cdc,c~ 50c9509d-~
## 18 4022235684 Anax junius Dr~            28.4            -82.7 cdc    50c9509d-~
## 19 4028827520 Anax junius Dr~            29.5            -81.2 cdc,c~ 50c9509d-~
## 20 4029039746 Anax junius Dr~            28.6            -80.8 cdc,c~ 50c9509d-~
## # ... with 84 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
## #   lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
## #   basisOfRecord <chr>, individualCount <int>, occurrenceStatus <chr>,
## #   lifeStage <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, ...


On your own: How would you alter the R code to return 300 observations? How would you alter the code to get observations 25,001-25,300?

Hint: there is no ‘offset’ available in the occ_search function. Consult the documentation for the package


# Get 300 observations 
GreenDarnerObs <- occ_search(scientificName = "Anax junius",
                             limit = 300)

# Get 25,000-25,300 observations 
GreenDarnerObs <- occ_search(scientificName = "Anax junius",
                             limit = 300,
                             start = 25000)


On your own: Using the rgbif package - query GBIF to find all observations of your favorite taxon that occurred in 2022?

Consult the documentation for the package to determine how to specify a year range in your query


GreenDarnerObs_2022 <- occ_search(scientificName = "Anax junius",
                             limit = 300,
                             start = 0,
                             year = 2022)

Now that we have some experience querying data from GBIF using the API - let’s grab all the observations of Common Green Darner. Below is the code for how to get all 31,940 observations. This can take a little while depending on the number of observations. Here, I’ve included the data as part of the workshop materials on the virtual computing machine but I’ve included the code below for you alter for your own.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
# 
# THIS CODE IS HERE FOR ILLUSTRATIVE PURPOSES ONLY
# PLEASE DO NOT EXECUTE THIS CODE DURING THE WORKSHOP
# The following code takes approx 25 mins to execute 
#
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 

# Since we know the total number of observations (31,938) we can 
# iterate through the API until all observations are acquired 

# we know the limit of observations per query
limit <- 300 

# set up the values to start at for each query
starts <- seq(from = 0, to = 32000, by = limit)+1

# determine the number of queries are needed #
nQueries <- length(starts)

# Create a list to store data temporarily 
all_obs <- vector('list',nQueries)

# GBIF returns only data fields that have values on that particular page. 
# This can lead to issues down stream. Here, I provide a gbif data template
# so that all the data can be merged seemlessly later. 

gbif_data_template <- read.csv("../gbif_data_template.csv")
  
# Use the information above to get all the data 
# - time the process
a <- Sys.time()

# loop through all different starting values 
# when i = 1, start will equal 1
# when i = 2, start will equal 301
# when i = 3, start will equal 601
# ..
# when i = nQueries, start will equal 31801

for(i in 1:nQueries){

# have an empty data template to fill
empty_template <- gbif_data_template

obs <- occ_search(scientificName = "Anax junius",
                             limit = limit,
                             start = starts[i])

# find the intersecting fields - i.e., keep only the fields that we want
these <- intersect(names(gbif_data_template),names(obs$data))

# store the data in the empty list we created above
empty_template[1:nrow(obs$data),these] <- obs$data[1:nrow(obs$data),these]

# store the data in the list
all_obs[[i]] <- empty_template
}

# see how long the process takes
Sys.time()-a

# combine all the data in the list into a single object
CGD_all_obs <- do.call(rbind,all_obs)

# write data to file 
# saveRDS(CGD_all_obs, "CommonGreenDarner_all_obs.rds")

Using and analyzing the data

Now that we have some data, let’s summarize the data in different ways that may be of interest.

Read in the full Common Green Darner data set and have a quick look at what’s in it using the str function in R. That prints the structure of an object.

# read in the data 
CGD_obs <- readRDS("../CommonGreenDarner_all_obs.rds")

# have a quick look at the data #
str(CGD_obs,1)
## 'data.frame':    32100 obs. of  10 variables:
##  $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ OBJECTID          : logi  NA NA NA NA NA NA ...
##  $ gbifID            : chr  "4015431302" "4011663229" "4014914876" "4014939084" ...
##  $ abstract          : logi  NA NA NA NA NA NA ...
##  $ accessRights      : chr  "Public, non commercial use only" NA NA NA ...
##  $ accrualMethod     : logi  NA NA NA NA NA NA ...
##  $ accrualPeriodicity: logi  NA NA NA NA NA NA ...
##  $ accrualPolicy     : logi  NA NA NA NA NA NA ...
##  $ alternative       : logi  NA NA NA NA NA NA ...
##  $ audience          : logi  NA NA NA NA NA NA ...

Phenology

Let’s find out when during the year Common Green Darner’s are observed the most. We can do that using the month in the data set. Our goal is to get the number of observations in each month out of the year and create a plot of those values. That will help users visualize when they are most commonly observed and most active. To do this we can use a function called table. The table function counts the number of times a value is found within a column. Since month there is a month field for every observation with a known date, we can use the table function to get the information we’re interested in.

# NOTE: the ( ) around the entire argument below tells R to print the object after 
# it's created

(month_obs <- table(CGD_obs$month))
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  164  163  852 2151 2117 3364 4285 5772 6902 3593 1074  349

We can see from the data that the most observations are submitted in September (month = 9). We can create a simple plot showing these data.

# Create a simple barplot of the data
barplot(month_obs)
barplot(month_obs,
        ylab = "Number of Observations",
        xaxt = "n",
        xlab = "Month",
        las = 1)
axis(1, at = seq(0.7,13.9,1.2),labels = month.abb)

Let’s take a look at the first observation in Vermont for each of the last 10 years. There are many ways to complete that task in R. One way is illustrated below.

# Subset the data and keep only the observations that occur within Vermont 
# here we use the 'stateProvince' field in the data 
# we keep only the rows that satisfy that query and keep all the columns 
# the R syntax follows as such data[row,column]

VTdata <- CGD_obs[CGD_obs$stateProvince=="Vermont",]

# Once we have VTdata we need to select all observations that were observed 
# in 2013 or later 

VTdata_decade <- VTdata[VTdata$year>= 2013,]

# Now that we have the data we want we can find the earliest observation
# date in each year. We can do that using the tapply function. The tapply
# function applies a function to all values of X (eventDate) to each unique 
# value in INDEX (year). The function (FUN) we're applying here is the minimum.

minVT <- tapply(X = VTdata_decade$eventDate,
                INDEX = list(VTdata_decade$year),
                FUN = min, na.rm = TRUE)
##                  2013                  2014                  2015 
## "2013-06-14T00:00:00" "2014-06-01T03:13:00" "2015-05-10T00:00:00" 
##                  2016                  2017                  2018 
## "2016-05-12T00:00:00" "2017-05-16T00:00:00" "2018-05-09T13:33:00" 
##                  2019                  2020                  2021 
## "2019-05-20T10:34:00" "2020-06-03T17:26:00" "2021-05-19T09:52:00" 
##                  2022 
## "2022-05-18T00:00:00"

On your own: Determine which New England state had the first observation in 2022.

More challenging: Determine the first observation date for adult, males in every state or province.

Additional challenge: Determine how many observations occur in each state or province.

# A solution for the first problem 
NEdata <- CGD_obs[CGD_obs$stateProvince %in% c("Vermont","New Hampshire","Rhode Island",
                                               "Connecticut","Maine","Massachusetts"),]

NE_2022 <- NEdata[NEdata$year == 2022,]


minNE <- tapply(X = NE_2022$eventDate,
                INDEX = list(NE_2022$stateProvince),
                FUN = min, na.rm = TRUE)


# Solution for the second # 
AdultMale_data <- CGD_obs[CGD_obs$sex=="MALE" & CGD_obs$lifeStage=="Adult",]

first_male <- tapply(AdultMale_data$eventDate,
                     list(AdultMale_data$stateProvince),
                     min,  na.rm = TRUE)

# state with the most observations? 
table(CGD_obs$stateProvince)

Explore spatial data

We don’t have time to go over how to use R as GIS in this workshop. There are lots of great resources online to get you started. I have a similar workshop called Get Spatial! Using R as GIS. It’s a little dated now but it follows a similar format to this.

Now that we have the all the data for Common Green Darner, let’s see where the observations occur. To do that we’ll first need to convert the data into a spatial layer. We’ll need to load another R package called sf to do that. Details about the functions found in the sf package can be found here. Many people use R as GIS and there are a lot of great online resources if you want to get started using R for spatial analyzes. In this workshop, we’ll do some basic spatial things without getting into the details.

# Load the simple features packages (sf)
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE

Now that we have access to all the sf functions we can convert our data in to spatial data. FYI: all sf functions start with st_. In be code below we will convert our CGD_obs into spatially explicit data using the st_as_sf function. We need to provide the data we want to convert (x) and tell the function where the geographic coordinates (coords) are stored in the data. In GBIF the coordinates are stored in the decimalLongitude and decimalLatitude columns. Refer to DwC terms if needed. The coordinate reference system (crs) is standard WGS84 (unprojected, longitude latitude). Each different projection has a EPSG number. WGS84 = 4326. Note that the coordinates need to be added in the correct order (Longitude [east/west] then Latitude [north/south]).

Darner_spatial <- st_as_sf(x = CGD_obs,
                           coords = c("decimalLongitude",
                                      "decimalLatitude"),
                           crs = 4326)
## Error in st_as_sf.data.frame(x = CGD_obs, coords = c("decimalLongitude", : missing values in coordinates not allowed

We received an error message The error message tells us that NA (empty) values are not allowed. Rather, it’s not possible to make a spatial layer of some observations that don’t have coordinates. Let’s remove the observations without coordinates and try again.

# Remove observations with empty coordinates 
# Keep all observations that are not (!) NA

CGD_obs_coords <- CGD_obs[!is.na(CGD_obs$decimalLongitude),]

Darner_spatial <- st_as_sf(x = CGD_obs_coords,
                           coords = c("decimalLongitude",
                                      "decimalLatitude"),
                           crs = 4326)

Our new spatial object Darner_spatial has a new column called geometry. Let’s take a quick peek at the differences between our two objects (CGD_obs and Darner_spatial).

# CGD_obs (non-spatial)
head(CGD_obs)

# Darner_spatial (spatial)
Darner_spatial

Let’s make a simple plot of the observations to determine where they were observed.

# To plot just the locations we need to specify "geometry"
# note the $ sign means get the "geometry" column in the Darner_spatial object
plot(Darner_spatial$geometry,
     main = "Common Green Darner Observations in GBIF", # title of plot
     pch = 19,   # filled in dots
     cex = 0.5)  # size of dots

Calculate a simple distribution

We can see from the figure above that Common Green Darner have been observed in Europe and Asia. Let’s assume for this analysis we’re only interested observations that occur in North America. The code below will keep only the observations that occurred within North America.

# Get a tally of all observations in different continents
table(Darner_spatial$continent)
## 
##        AFRICA          ASIA        EUROPE NORTH_AMERICA       OCEANIA 
##             1           186            34         30558           100
# Keep all the rows where the column continent = "NORTH_AMERICA"
NorthAm_CGD <- Darner_spatial[Darner_spatial$continent == "NORTH_AMERICA",]

plot(NorthAm_CGD$geometry)

Looking good! Now that we have just the North America observations we can calculate a simple range for the species. To do this we’ll use a minimum convex polygon (st_convex_hull) This is a relatively crude way to estimate range size but it’s relatively simple to do. The function will take all the spatial points and create a polygon around the observations. We’ll use that as it’s range.

DarnerRange <- st_convex_hull(st_union(NorthAm_CGD))
plot(NorthAm_CGD$geometry, col = "gray60", pch = 19)
plot(DarnerRange, add = TRUE, lwd = 2, border = "black")